---
title: Trafilatura
description: TrafilaturaTools provides advanced web scraping and text extraction capabilities with support for crawling and content analysis.
---

## Example

The following agent can extract and analyze web content:

```python
from agno.agent import Agent
from agno.tools.trafilatura import TrafilaturaTools

agent = Agent(
    instructions=[
        "You are a web content extraction specialist",
        "Extract clean text and structured data from web pages",
        "Provide detailed analysis of web content and metadata",
        "Help with content research and web data collection",
    ],
    tools=[TrafilaturaTools()],
)

agent.print_response("Extract the main content from https://example.com/article", stream=True)
```

## Toolkit Params

| Parameter            | Type                | Default     | Description                                                    |
| -------------------- | ------------------- | ----------- | -------------------------------------------------------------- |
| `output_format`      | `str`               | `"txt"`     | Default output format (txt, json, xml, markdown, csv, html). |
| `include_comments`   | `bool`              | `False`     | Whether to extract comments along with main text.            |
| `include_tables`     | `bool`              | `False`     | Whether to include table content.                            |
| `include_images`     | `bool`              | `False`     | Whether to include image information (experimental).         |
| `include_formatting` | `bool`              | `False`     | Whether to preserve text formatting.                         |
| `include_links`      | `bool`              | `False`     | Whether to preserve links (experimental).                    |
| `with_metadata`      | `bool`              | `False`     | Whether to include metadata in extractions.                  |
| `favor_precision`    | `bool`              | `False`     | Whether to prefer precision over recall.                     |
| `favor_recall`       | `bool`              | `False`     | Whether to prefer recall over precision.                     |
| `target_language`    | `Optional[str]`     | `None`      | Target language filter (ISO 639-1 format).                  |
| `deduplicate`        | `bool`              | `True`      | Whether to remove duplicate segments.                        |
| `max_crawl_urls`     | `int`               | `100`       | Maximum number of URLs to crawl per website.                |
| `max_known_urls`     | `int`               | `1000`      | Maximum number of known URLs during crawling.               |
| `enable_extract_text` | `bool`              | `True`      | Whether to extract text content.                            |
| `enable_extract_metadata` | `bool`              | `True`      | Whether to extract metadata information.                    |
| `enable_html_to_text` | `bool`              | `True`      | Whether to convert HTML content to clean text.              |
| `enable_batch_extract` | `bool`              | `True`      | Whether to extract content from multiple URLs in batch.     |


## Toolkit Functions

| Function              | Description                                                      |
| --------------------- | ---------------------------------------------------------------- |
| `extract_text`        | Extract clean text content from a URL or HTML.                  |
| `extract_metadata`    | Extract metadata information from web pages.                    |
| `html_to_text`        | Convert HTML content to clean text.                             |
| `crawl_website`       | Crawl a website and extract content from multiple pages.        |
| `batch_extract`       | Extract content from multiple URLs in batch.                    |
| `get_page_info`       | Get comprehensive page information including metadata.          |


## Developer Resources

- View [Tools Source](https://github.com/agno-agi/agno/blob/main/libs/agno/agno/tools/trafilatura.py)
- [Trafilatura Documentation](https://trafilatura.readthedocs.io/)
- [Web Scraping Best Practices](https://trafilatura.readthedocs.io/en/latest/corefunctions.html)
